Skip to content

Add temporality awareness to openrag responses and chunk creation#130

Closed
EnjoyBacon7 wants to merge 15 commits intodevfrom
temporality
Closed

Add temporality awareness to openrag responses and chunk creation#130
EnjoyBacon7 wants to merge 15 commits intodevfrom
temporality

Conversation

@EnjoyBacon7
Copy link
Collaborator

@EnjoyBacon7 EnjoyBacon7 commented Nov 4, 2025

API changes:

  • OpenRAG now optionally expects "datetime" as as file metadata.
  • Profiding "modified_at" and "created_at" in indexing requests overrides file-provided created and modified times
  • "indexed_at" is ignored when provided in indexing request (it is overridden by the current date)

Summary by CodeRabbit

  • New Features

    • Temporal filtering for searches (created/modified/indexed/document dates) with automatic date extraction and configurable temporal weighting/decay
    • Temporal-aware reranking prioritizes recent results; search results surface temporal metadata and current-system timestamp in prompts
  • Documentation

    • Expanded API and guides with temporal filtering examples and usage patterns
  • Chores

    • Docker images updated to latest tags
  • Bug Fixes

    • Timezone-aware handling of file timestamps for accurate metadata

✏️ Tip: You can customize this high-level summary in your review settings.

EnjoyBacon7 and others added 6 commits October 22, 2025 16:51
- openrag/utils/temporal.py
TemporalQueryNormalizer class extracts temporal filters
Date patterns recognition in multiple languages
Relative time Extraction

- openrag/components/indexer/chunker/chunker.py
Chunkers now add an indexed_at timestamp to documents
It is expected that indexed documents provide a created_at timestamp if available

- openrag/components/indexer/vectordb/vectordb.py
Milvus schema updated to include created_at and indexed_at fields
Added Temporal filtering support in vector database queries

- openrag/components/retriever.py & pipeline.py
Added temporal_filter parameter to all retrievers
Automatic temporal extraction from queries via TemporalQueryNormalizer
Injects current UTC datetime into system prompt

- openrag/components/reranker.py
Reranker now combines relevance and temporal scores using a linear decay formula

- RERANKER_TEMPORAL_WEIGHT (default 0.3)
- RERANKER_TEMPORAL_DECAY_DAYS (default 365)
Added extraction for "modified_at" field in indexation
Added "datetime" metadata field as preferred field for date information
Added formatted prompt logging in DEBUG mode
Fixed db search with date filters to use OR logic between date fields
@EnjoyBacon7 EnjoyBacon7 marked this pull request as draft November 4, 2025 16:00
@dodekapod dodekapod self-requested a review November 13, 2025 10:52
@paultranvan
Copy link
Collaborator

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces comprehensive temporal awareness capabilities across the system. Changes include adding temporal metadata fields (datetime, created_at, modified_at, indexed_at) to document chunks and vector database schema, implementing temporal filtering logic in retrieval queries with OR-based logic across temporal fields, and integrating temporal scoring into reranking using configurable weight and decay parameters. A new TemporalQueryNormalizer utility extracts temporal expressions from queries, while configuration, documentation, and system prompts are updated to support these capabilities. Retriever interfaces are extended with optional temporal_filter parameters, and Docker image references are updated to latest tags.

Possibly related PRs

  • Feat/chunking #165: Overlaps with chunk metadata/timestamp additions and chunking refactor that modify how chunk metadata is constructed and stored.

Suggested labels

feat

Suggested reviewers

  • dodekapod

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.13% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: adding temporal awareness to document chunks (via indexed_at timestamps) and responses (via temporal filtering, scoring, and context injection).

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
prompts/example1/sys_prompt_tmpl.txt (1)

21-26: Temporal awareness rules are well-structured.

The new Rule 4 provides clear guidance for handling temporal context. The instructions cover document metadata usage, prioritization of recent documents, conflict resolution, and recency communication.

Note: A past review suggested simplifying line 24 to reduce redundancy. The current version is more explicit but slightly verbose. Consider if the additional detail is necessary.

openrag/utils/temporal.py (2)

40-50: Consider renaming english_patterns to clarify its purpose.

As noted in a previous review, the variable name english_patterns could be misleading since it includes patterns for multiple languages (French, German, Spanish, Italian, Portuguese). A name like multilingual_fallback_patterns or common_language_patterns would better describe its purpose.

-        # English patterns for backward compatibility
-        self.english_patterns = {
+        # Multilingual fallback patterns for common temporal expressions
+        self.multilingual_patterns = {
             r'\b(today|aujourd\'hui|heute|hoy|oggi|hoje)\b': lambda: self._get_today(),
             r'\b(yesterday|hier|ayer|ieri|ontem)\b': lambda: self._get_yesterday(),
             r'\b(last|past|recent)\b': lambda: self._get_last_n_days(30),
         }

Note: Update references in extract_temporal_filter accordingly.


174-210: High false-positive risk in relative time extraction - previously flagged.

As noted in a previous review, this heuristic is prone to false positives. Queries like "summarize documents about 7 Eleven acquisition" or "5 star reviews" will incorrectly trigger temporal filtering for 7 or 5 days respectively.

The pattern (\d+)\s*\w+|\w+\s+(\d+) is too permissive—it matches any number adjacent to any word.

Consider:

  1. Requiring specific time-unit keywords (days, weeks, months, años, jours, etc.)
  2. Adding a confidence threshold or validation
  3. Making this extraction opt-in via configuration
🧹 Nitpick comments (12)
docs/content/docs/documentation/data_model.md (1)

109-149: Clarify temporal filter semantics (UTC + inclusivity) for better alignment with implementation

The temporal schema description is solid, but it would help to explicitly state that:

  • All stored timestamps are normalized to UTC (even if represented as +00:00 instead of Z), and
  • Whether *_after / *_before are inclusive (>=/<=) or exclusive bounds.

This avoids confusion when users try to match query behavior to the documented model and Milvus filter expressions, especially when mixing manual temporal_filter with other metadata filters.

docker-compose.yaml (1)

7-7: Prefer versioned Docker tags over :latest for reproducible deployments

Switching to linagoraai/openrag:latest and linagoraai/indexer-ui:latest is convenient for dev, but it makes production deployments non-deterministic and harder to roll back.

Consider:

  • Pinning to explicit versions (or at least a well-defined channel tag), and/or
  • Documenting that :latest is intended only for local/dev use.

Also applies to: 60-60

openrag/components/indexer/chunker/chunker.py (1)

5-5: Per‑chunk indexed_at metadata looks correct; consider DRYing and reusing a single timestamp

The new chunk_meta construction in all three splitters:

chunk_meta = {
    **metadata,
    "page": start_page,
    "indexed_at": datetime.now(timezone.utc).isoformat(),
}
filtered_chunks.append(Document(page_content=chunk_w_context, metadata=chunk_meta))
  • Correctly:
    • Preserves existing metadata,
    • Sets page to the chunk’s start page, and
    • Overrides any incoming indexed_at with the current UTC time, matching the PR’s “ignore request-provided indexed_at” requirement.

Two optional refinements:

  1. Compute a single indexed_at_ts = datetime.now(timezone.utc).isoformat() once per split_document call and reuse it for all chunks to:

    • Avoid tiny timestamp drift between chunks, and
    • Save repeated datetime.now calls.
  2. Extract a small helper (e.g., _build_chunk_meta(metadata, start_page, indexed_at_ts)) to avoid duplicating the same dict construction across the three splitters.

Also applies to: 239-244, 356-361, 461-466

openrag/components/utils.py (1)

121-156: Replace bare except in format_date with a narrower exception (and optionally reuse shared temporal utilities)

The new format_date helper is useful, but this pattern:

def format_date(iso_date: str) -> str:
    try:
        from datetime import datetime
        dt = datetime.fromisoformat(iso_date.replace('Z', '+00:00'))
        return dt.strftime("%Y-%m-%d %H:%M")
    except:
        return iso_date

has two drawbacks:

  • The bare except: will also swallow critical exceptions (e.g., KeyboardInterrupt, SystemExit), and static analysis flags it (E722).
  • Re-importing datetime on every call is unnecessary.

A safer version:

-from datetime import datetime
-
 def format_date(iso_date: str) -> str:
     """Convert ISO date to readable format: 2025-11-02 14:30"""
     try:
-        from datetime import datetime
-        dt = datetime.fromisoformat(iso_date.replace('Z', '+00:00'))
+        dt = datetime.fromisoformat(iso_date.replace("Z", "+00:00"))
         return dt.strftime("%Y-%m-%d %H:%M")
-    except:
+    except (ValueError, TypeError):
         return iso_date

…and add a single from datetime import datetime at module level if you want to avoid the local import entirely.

If you already have shared parsing/normalization logic in openrag/utils/temporal.py, you might also consider delegating to it here to keep temporal handling consistent in one place.

docs/content/docs/documentation/API.mdx (1)

222-229: Align temporal filtering examples between raw HTTP and OpenAI client usage

The temporal filtering section and examples are helpful, but there’s a subtle inconsistency:

  • The curl example correctly shows metadata.temporal_filter as part of the JSON body.
  • The first Python snippet under Temporal Filtering uses metadata={...} directly in client.chat.completions.create(...).
  • The later “Example with Temporal Filtering” correctly uses extra_body={"metadata": {"temporal_filter": {...}}} for the OpenAI client.

To reduce confusion for users:

  • Either update the earlier Python example to also use extra_body={"metadata": {...}}, or
  • Explicitly distinguish:
    • Raw HTTP: metadata is a top-level JSON field.
    • OpenAI Python client: metadata goes inside extra_body.

This keeps the temporal_filter contract clear while matching how the OpenAI client actually passes through custom fields.

Also applies to: 233-259, 301-344

docs/content/docs/documentation/features_in_details.md (1)

87-131: Temporal Awareness documentation is clear; consider adding cross‑links for discoverability

The new Temporal Awareness section accurately reflects the temporal fields, scoring behavior, and configuration knobs introduced elsewhere (priority order, env vars, etc.).

As an optional improvement, consider:

  • Linking to the API Temporal Filtering section and the Vector Database Schema (Milvus) data model section so readers can easily jump from the feature overview to request payload examples and the underlying schema.
openrag/components/reranker.py (1)

70-72: Consider logging at debug level instead of warning for common parsing scenarios.

The warning-level log for temporal score calculation errors may be noisy if documents frequently have malformed or missing date metadata. Consider using debug level, or only warn for unexpected error types.

         except Exception as e:
-            self.logger.warning(f"Error calculating temporal score: {e}")
+            self.logger.debug(f"Could not calculate temporal score, using neutral: {e}")
             return 0.5  # Neutral score on error
openrag/components/pipeline.py (3)

2-3: Remove unused json import.

The json module is imported but doesn't appear to be used in this file.

 import copy
-import json
 from datetime import datetime, timezone

46-51: Use explicit Optional type hint for temporal_filter parameter.

Per PEP 484 and the static analysis hint, implicit Optional (using = None without the type) should be explicit.

+from typing import Optional
+
+...
+
     async def retrieve_docs(
-        self, partition: list[str], query: str, use_map_reduce: bool = False, temporal_filter: dict = None
+        self, partition: list[str], query: str, use_map_reduce: bool = False, temporal_filter: Optional[dict] = None
     ) -> list[Document]:

183-186: Reconsider logging full system prompt content.

Logging the full system prompt at debug level (line 186) may expose the entire retrieved document context in logs. While useful for debugging, this could lead to large log entries and potential sensitive data exposure in production.

Consider logging only a truncated preview or removing this log in favor of the preceding log that includes prompt length and doc count.

         # Debug: log the formatted system prompt
         logger.debug("System prompt with context", prompt_length=len(system_prompt_content), doc_count=len(docs))
-        logger.debug("Full system prompt", content=system_prompt_content)
+        # Uncomment for local debugging only:
+        # logger.debug("Full system prompt", content=system_prompt_content)
openrag/utils/temporal.py (2)

62-81: Bare except clauses swallow all exceptions including system exits.

The bare except clauses here will catch KeyboardInterrupt, SystemExit, etc. While the intent is graceful fallback, consider catching specific exceptions.

     def _parse_numeric_date(self, match) -> Tuple[datetime, datetime]:
         """Parse numeric date: DD/MM/YYYY or MM/DD/YYYY"""
         # Try DD/MM/YYYY format first (more common internationally)
         try:
             day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
             if 1 <= month <= 12 and 1 <= day <= 31:
                 return self._get_specific_date(f"{year:04d}-{month:02d}-{day:02d}")
-        except:
+        except (ValueError, AttributeError):
             pass
         
         # Try MM/DD/YYYY format (US)
         try:
             month, day, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
             if 1 <= month <= 12 and 1 <= day <= 31:
                 return self._get_specific_date(f"{year:04d}-{month:02d}-{day:02d}")
-        except:
+        except (ValueError, AttributeError):
             pass

221-256: Replace bare except with specific exception types.

The bare except clauses on lines 231 and 254 catch all exceptions. Use specific exception types for safer error handling.

         for pattern, parse_func in self.universal_patterns.items():
             match = re.search(pattern, query)
             if match:
                 try:
                     start, end = parse_func(match)
                     return {
                         'created_after': start.isoformat(),
                         'created_before': end.isoformat(),
                     }
-                except:
+                except (ValueError, AttributeError, TypeError):
                     continue
         for pattern, date_func in self.english_patterns.items():
             match = re.search(pattern, query_lower, re.IGNORECASE)
             if match:
                 try:
                     start, end = date_func()
                     return {
                         'created_after': start.isoformat(),
                         'created_before': end.isoformat(),
                     }
-                except:
+                except (ValueError, AttributeError, TypeError):
                     continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed7a7b3 and 103f794.

📒 Files selected for processing (14)
  • .hydra_config/config.yaml (1 hunks)
  • docker-compose.yaml (2 hunks)
  • docs/content/docs/documentation/API.mdx (2 hunks)
  • docs/content/docs/documentation/data_model.md (1 hunks)
  • docs/content/docs/documentation/features_in_details.md (1 hunks)
  • openrag/components/indexer/chunker/chunker.py (4 hunks)
  • openrag/components/indexer/vectordb/vectordb.py (3 hunks)
  • openrag/components/pipeline.py (5 hunks)
  • openrag/components/reranker.py (3 hunks)
  • openrag/components/retriever.py (6 hunks)
  • openrag/components/utils.py (1 hunks)
  • openrag/routers/indexer.py (3 hunks)
  • openrag/utils/temporal.py (1 hunks)
  • prompts/example1/sys_prompt_tmpl.txt (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
openrag/components/retriever.py (2)
openrag/utils/dependencies.py (1)
  • get_vectordb (46-48)
openrag/components/indexer/vectordb/vectordb.py (4)
  • async_search (69-77)
  • async_search (471-597)
  • async_multi_query_search (80-88)
  • async_multi_query_search (443-469)
openrag/components/pipeline.py (2)
openrag/utils/temporal.py (2)
  • TemporalQueryNormalizer (11-286)
  • extract_temporal_filter (212-257)
openrag/components/retriever.py (4)
  • retrieve (30-36)
  • retrieve (46-82)
  • retrieve (112-152)
  • retrieve (179-213)
🪛 GitHub Actions: Smoke test
openrag/components/indexer/vectordb/vectordb.py

[error] 438-438: Unexpected error while adding a document: insert missed the 'datetime' field with no nullable or default_value set.


[error] 438-438: Unexpected error while adding a document: insert missed the 'datetime' field with no nullable or default_value set.


[error] 438-438: Unexpected error while adding a document: insert missed the 'datetime' field with no nullable or default_value set.


[error] 438-438: Unexpected error while adding a document: insert missed the 'datetime' field with no nullable or default_value set.

🪛 Ruff (0.14.8)
openrag/components/utils.py

131-131: Do not use bare except

(E722)

openrag/utils/temporal.py

69-69: Do not use bare except

(E722)


69-70: try-except-pass detected, consider logging the exception

(S110)


77-77: Do not use bare except

(E722)


77-78: try-except-pass detected, consider logging the exception

(S110)


231-231: Do not use bare except

(E722)


231-232: try-except-continue detected, consider logging the exception

(S112)


254-254: Do not use bare except

(E722)


254-255: try-except-continue detected, consider logging the exception

(S112)

openrag/components/reranker.py

68-68: Consider moving this statement to an else block

(TRY300)


70-70: Do not catch blind exception: Exception

(BLE001)

openrag/components/pipeline.py

47-47: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🔇 Additional comments (13)
.hydra_config/config.yaml (1)

49-57: Temporal reranker configuration is wired consistently

The new reranker.temporal_weight and temporal_decay_days settings look consistent with the rest of the config (using oc.decode and env overrides) and match the documented temporal scoring behavior.

No changes needed here.

openrag/routers/indexer.py (1)

2-2: modified_at from request metadata is being ignored; created_at and modified_at handling is asymmetrical

In both add_file and put_file, the handling of created_at and modified_at differs:

# Use provided created_at if available, otherwise extract from file system
if "created_at" not in metadata or not metadata["created_at"]:
    metadata["created_at"] = datetime.fromtimestamp(file_stat.st_ctime, tz=timezone.utc).isoformat()

# Extract file modification time (always from file system)
metadata["modified_at"] = datetime.fromtimestamp(file_stat.st_mtime, tz=timezone.utc).isoformat()
  • created_at respects a user-provided value and only falls back to filesystem metadata.
  • modified_at is unconditionally overwritten from the filesystem, ignoring any value provided in the request.

For consistent behavior, consider applying the same conditional logic to modified_at:

-    # Extract file modification time (always from file system)
-    metadata["modified_at"] = datetime.fromtimestamp(file_stat.st_mtime, tz=timezone.utc).isoformat()
+    # Use provided modified_at if available, otherwise extract from file system
+    if "modified_at" not in metadata or not metadata["modified_at"]:
+        metadata["modified_at"] = datetime.fromtimestamp(
+            file_stat.st_mtime, tz=timezone.utc
+        ).isoformat()

Apply the same change in both add_file and put_file.

(Secondary note: on POSIX systems, st_ctime represents metadata change time rather than creation time; if strict creation time semantics are needed, this may need revisiting.)

prompts/example1/sys_prompt_tmpl.txt (1)

5-6: LGTM - Dynamic timestamp injection.

Good addition of the {current_datetime} placeholder for temporal awareness in the system prompt.

openrag/components/reranker.py (3)

17-27: LGTM - Temporal scoring parameters initialization.

Good defaults for temporal_weight (0.3) and temporal_decay_days (365), with proper logging of configuration values.


60-66: Handle future-dated documents gracefully.

If a document has a future date (e.g., scheduled content or clock skew), days_old becomes negative, and the formula 1.0 - days_old / temporal_decay_days will exceed 1.0 before being clamped by min(1.0, ...). While the clamp handles this, consider whether future-dated documents should be treated differently (e.g., capped at 1.0 or flagged).

The current implementation works correctly due to the min(1.0, ...) clamp.


99-127: LGTM - Temporal scoring integration in reranking.

The weighted combination of relevance and temporal scores is well-implemented. Storing all three scores (relevance_score, temporal_score, combined_score) in metadata provides good transparency for debugging and downstream use. Re-sorting by combined_score ensures correct ordering.

openrag/components/indexer/vectordb/vectordb.py (2)

332-355: LGTM - Temporal field indexes.

INVERTED indexes on VARCHAR temporal fields are appropriate for enabling efficient range-based filtering queries.


500-518: Temporal OR-logic may return unexpected results when only some fields are populated.

The current implementation applies the same after_value/before_value to all four temporal fields with OR logic. This means a document matches if any of its temporal fields fall within the range, even if other fields are outside the range.

For documents where only indexed_at is set (no user-provided datetime), this works as intended. However, if a document has conflicting timestamps (e.g., created_at in January but indexed_at in December), it will match queries for both time periods.

If this is intentional for maximum recall, consider adding a comment explaining the design decision.

openrag/components/pipeline.py (2)

135-157: LGTM - Temporal filter extraction flow.

Good implementation: extracts from metadata first, falls back to query analysis, logs extracted filters appropriately, and passes through to retrieval.


191-201: _prepare_for_completions doesn't support temporal filtering.

The _prepare_for_completions method calls retrieve_docs without passing temporal_filter, meaning the completions endpoint won't benefit from temporal awareness. If temporal filtering should apply to both chat and completions endpoints, consider adding similar temporal filter extraction here.

openrag/utils/temporal.py (1)

259-286: LGTM - Query augmentation implementation.

The augment_query method provides a clean way to add temporal context to queries for better retrieval. The date formatting using %B %Y produces readable output.

openrag/components/retriever.py (2)

3-3: LGTM!

The Optional import is correctly added to support the new temporal_filter parameter type hints.


30-35: LGTM!

The abstract method signature correctly defines the temporal_filter parameter as optional with appropriate type hints.

It is now a lot more conservative when interpreting dates, as such, it will less often interpret dates, but when it does, it is more accurate.
@EnjoyBacon7
Copy link
Collaborator Author

@coderabbitai review again!

@coderabbitai
Copy link

coderabbitai bot commented Dec 24, 2025

@EnjoyBacon7, I'll review the changes in this PR again.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
openrag/utils/temporal.py (2)

89-101: Consider making date format preference configurable.

The current logic always prefers DD/MM/YYYY over MM/DD/YYYY for ambiguous dates (e.g., 01/02/2024 is interpreted as 1 February rather than 2 January). This bias may surprise users in regions that use MM/DD/YYYY conventions.

🔎 Suggested approach

Add a configuration parameter to control date format preference:

-    def __init__(self):
+    def __init__(self, prefer_day_first: bool = True):
+        self.prefer_day_first = prefer_day_first
         # Ordered: higher precision first

Then update _parse_numeric_date:

     def _parse_numeric_date(self, match):
         a, b, y = map(int, match.groups())
 
-        # Prefer DD/MM/YYYY if valid
-        # Note: I have no idea how to correctly disambiguate MM/DD/YYYY vs DD/MM/YYYY
-        if 1 <= b <= 12:
-            return self._specific_date(y, b, a)
-
-        # Fallback MM/DD/YYYY
-        if 1 <= a <= 12:
-            return self._specific_date(y, a, b)
+        if self.prefer_day_first:
+            # Try DD/MM/YYYY first
+            if 1 <= b <= 12 and 1 <= a <= 31:
+                return self._specific_date(y, b, a)
+            # Fallback MM/DD/YYYY
+            if 1 <= a <= 12 and 1 <= b <= 31:
+                return self._specific_date(y, a, b)
+        else:
+            # Try MM/DD/YYYY first
+            if 1 <= a <= 12 and 1 <= b <= 31:
+                return self._specific_date(y, a, b)
+            # Fallback DD/MM/YYYY
+            if 1 <= b <= 12 and 1 <= a <= 31:
+                return self._specific_date(y, b, a)
 
         raise ValueError

155-163: Consider using word boundaries for keyword matching.

The current substring matching (if word in q) could theoretically match keywords as parts of larger words, though this is unlikely for the keywords provided.

🔎 More robust approach using word boundaries
     def _extract_keywords(self, query: str):
         q = query.lower()
         for word, offset in self.keyword_ranges.items():
-            if word in q:
+            if re.search(r'\b' + re.escape(word) + r'\b', q):
                 day = datetime.now(timezone.utc) - timedelta(days=offset)
                 start = day.replace(hour=0, minute=0, second=0, microsecond=0)
                 end = day.replace(hour=23, minute=59, second=59, microsecond=999999)
                 return start, end
         return None
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 103f794 and a09d4b5.

📒 Files selected for processing (3)
  • openrag/utils/README.md
  • openrag/utils/temporal.py
  • prompts/example1/sys_prompt_tmpl.txt
✅ Files skipped from review due to trivial changes (1)
  • openrag/utils/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • prompts/example1/sys_prompt_tmpl.txt
🔇 Additional comments (1)
openrag/utils/temporal.py (1)

116-140: LGTM!

The range builder methods correctly handle:

  • Full-day UTC-aligned ranges with microsecond precision
  • Month boundaries including December edge case
  • Year ranges
  • Relative time calculations with proper timezone handling

"jahr": 365, "jahre": 365,
}

self.relative_pattern = re.compile(r'(\d+)\s*(\w+)', re.IGNORECASE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the relative pattern against false positives.

The pattern r'(\d+)\s*(\w+)' is very broad and will match any number followed by any word. While the code later checks if the unit is in time_units, queries like "summarize 5 year plans" or "documents about 12 month contracts" will still extract temporal filters, potentially filtering out relevant results.

This addresses the concern raised in past review comments about false positives.

Consider adding word boundaries or requiring more context (e.g., "last", "past", "in the") before accepting a relative temporal expression:

#!/bin/bash
# Test queries that might produce false positives
python3 << 'EOF'
import sys
sys.path.insert(0, '.')
from openrag.utils.temporal import TemporalQueryNormalizer

normalizer = TemporalQueryNormalizer()

test_queries = [
    "summarize 5 year plans",
    "documents about 12 month contracts", 
    "7 eleven acquisition details",
    "last 5 years",  # should match
    "past 12 months",  # should match
]

for query in test_queries:
    result = normalizer.extract_temporal_filter(query)
    print(f"Query: {query!r}")
    print(f"  → {result}")
    print()
EOF
🤖 Prompt for AI Agents
In openrag/utils/temporal.py around line 65, the relative_pattern
r'(\d+)\s*(\w+)' is too permissive and yields false positives (e.g., "5 year
plans"); tighten the pattern to require word boundaries and a contextual prefix
like "(?:\b(?:last|past|in the)\b\s*)?" before the number or require the prefix
to be mandatory for matching, update the regex flags to be case-insensitive, and
ensure the code still validates the captured unit against time_units; adjust
extract_temporal_filter to only accept matches when the context prefix is
present (or add an explicit boolean flag to allow relaxed matching), and
add/update unit tests using the provided test queries to confirm false positives
are rejected and valid relative phrases are accepted.

Since `nullable=True` does not seem to work as expected. All values are set to `None` if missing.
Strings cannot be null in Milvus. Since empty strings are falsy values in Python, its a drop-in fix.
@EnjoyBacon7
Copy link
Collaborator Author

This PR doesn't meet quality requirements and is superseded by #259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants